-
Notifications
You must be signed in to change notification settings - Fork 8
Add GPU backends to distributed CI pipeline #1012
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
cscs-ci run distributed |
f310db9 to
fb3927e
Compare
fb3927e to
1fd6389
Compare
|
cscs-ci run distributed |
1 similar comment
|
cscs-ci run distributed |
|
cscs-ci run distributed |
|
cscs-ci run distributed |
This reverts commit e30c2f7.
|
cscs-ci run distributed |
|
cscs-ci run distributed |
1 similar comment
|
cscs-ci run distributed |
ab4ac8f to
8f04d36
Compare
|
cscs-ci run distributed |
|
cscs-ci run distributed |
|
cscs-ci run distributed |
Make revert_repeated_index_to_invalid numpy-only as it's not usefully vectorized
|
cscs-ci run distributed |
|
cscs-ci run distributed |
|
cscs-ci run distributed |
1 similar comment
|
cscs-ci run distributed |
|
cscs-ci run distributed |
|
cscs-ci run distributed |
| [tool.uv.sources] | ||
| dace = {index = "gridtools"} | ||
| ghex = {git = "https://github.com/msimberg/GHEX.git", branch = "async-mpi"} | ||
| ghex = {git = "https://github.com/philip-paul-mueller/GHEX.git", branch = "phimuell__async-mpi-2"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is updated because ghex-org/GHEX#190 contains a bugfix to how strides are computed for GPU buffers. Tests fail with master and async-mpi. We should get ghex-org/GHEX#190 merged ASAP to be able to use GHEX master here.
|
cscs-ci run distributed |
|
This is ready for reviews, but not ready for merging due to the GHEX update. |
|
sorry ;-) |
|
Mandatory Tests Please make sure you run these tests via comment before you merge!
Optional Tests To run benchmarks you can use:
To run tests and benchmarks with the DaCe backend you can use:
To run test levels ignored by the default test suite (mostly simple datatest for static fields computations) you can use:
For more detailed information please look at CI in the EXCLAIM universe. |
|
cscs-ci run distributed |
|
cscs-ci run default |
|
I've marked this a draft until #980 is merged. It should update the ghex commit to a new enough commit for this PR as well. |
Adds gtfn_gpu and dace_gpu backends to the distributed CI pipeline.
The base image is upgraded because it's possible, but not strictly necessary. The CPU-only version of the pipeline needed 25.04 (24.04 and 25.10 did not work for various reasons). However, since OpenMPI and libfabric are now built manually in the container the base image version is less of a constraint. 24.04 doesn't have matching GCC/CUDA versions and 26.04 doesn't exist yet, but the pipeline should eventually use 26.04.
OpenMPI and libfabric are built manually for slingshot support because getting the ubuntu repository packages to work with GPU support did not seem possible/easy. The installation is based on https://github.com/eth-cscs/cray-network-stack.
GHEX needs an upgrade, because there's a bug in how strides are calculated for GPU buffers. @philip-paul-mueller has already fixed this in ghex-org/GHEX#190 but we should wait for that to be merged (and probably test in icon-exclaim first).
This also fixes a few cupy/numpy incompatibilities.
revert_repeated_index_to_invalidwas updated to only deal with numpy for now as the connectivities are always numpy arrays.test_halo_exchange_for_sparse_fieldis markedembedded_only. The non-MPI test was already marked embedded-only.This does not try to unify the default and distributed CI pipeline definitions. That should, however, be done done sooner or later as well.